Introduction

The aim of this project….

This project used data from 1500 residential property sales in Ames, Iowa between 2006 and 2012. There are 82 explanatory variables in the data set, containing - nominal, ordinal, discrete, and continuous attributes. Continuous variables provide information about the multiple area dimensions of the house and property, such as the the size of the lot, garage among others. Discrete variables, on the other hand, quantify characteristics of the house/properties like the number of kitchens, baths, bedrooms, and parking spots. Nominal variables, generally, describe the multiple types of materials and locations, such name of the neighborhood or the type of foundations. Ordinal variables typically rate the condition and quality of multiple house characteristics and utilities.

Exploratory Data Analysis

Prior to doing the exploratory data analysis, we hypothesize that the following variables will be the most predictive of home price: lot area, home type, year built, and overall quality. We think these will be the most predictive because we assume that if we were to be in the market for a home, these would be among the top criteria we would consider when deciding which home to purchase.

Furthermore, we also hypothesize that a generalized additive model (GAM) will be the best model to use. We think so because the GAM will be able to combine the strengths of various different other model types including polynomials, cubic splines, and smoothing splines.

Exploring Selected Home Characteristics in the Dataset

Sale Price graph

When it comes to lot area, this dataset has many outliers as shown above. We found that there were 127 outliers greater than the minimum outlier value of 17755. As these made visualization difficult, we temporarily removed them. After removing the outliers, we can see that homes have a somewhat normal distribution in terms of lot area near the median of 9436.5 square feet.

From Figure 3, we see that 1-story homes that were built in 1946 or later make up the bulk of our dataset, specifically 1079. This is over one-third of our total dataset which has 2930 observations. Please not that the graphs are interactive so move your cursor over the graph to see more details. Furthermore, we can also observe from Figure 4, that most homes were built within a 5 year time range of 2005.

Summary Statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.511   4.000   5.000

Relationship Between Sale Price and Selected Characteristics

We can observe from Figure 5 that there is a large variation in sale price across across different neighborhoods. Even within neighborhood we also see variation. Investigating some housing characteristics may give us insight into the variation observed in price within neighborhoods.

We first examined overall quality (Figure 6) and - as expected - price increases as overall quality increases. Examining year built (Figure 7), we observe that the the newer a home is, the higher its price, on average.

In addition investigating the relationship between sale price with location, overall quality, and age of the house, we also examined at the relationship between sale price and home type. We find that 2 story homes built in the year 1946 or later have the highest median home prices (Figure 8).

Figure 9 explores the relationship between kitchen quality and sale price.The higher the kitchen quality the higher the median sale price. This increase, however, is non-linear (but rather quadratic). From Figure 10, we can see that - as expected - there is a gradual positive relationship between lot area and sales price.

Methodology

Sale Price:

Missing data:

Modifying variable class: We decided to keep the quality variables selected as a continuous variable as opposed to switching it to a factor. We did so because changing it to a factor would have lead to us dropping the “Very Poor” or “1” factor level as this level only has around 4 observations. By keeping the variable continuous, we are able to keep these observations and so better predict the home prices of homes that fall under this category.

Model Selection:

We began our model selection by reducing the number of variables within our housing data set. We created a subset data set that included the variables we hypothesized would important predictors of sale price.

These variables include:

We further included additional variables that will be utilized later in the report to create a renovation calculator.

Using our subset, we ran 1) a subset selection, (2) forward stepwise selection and (3) a forward stepwise selection for our variable selection. The graphs below are graphs that plot the number of variables against the BIC value for our three methods of variable selection.

Across all variable selection method, the a model with 7 variables has the lowest bIC score. Comparing the variables included in a model with seven variables across the three selection methods, we see that they all share the same variables.

Subset Selection
x
(Intercept)
tot_rms_abv_grd
overall_qual
lot_area
Bsmt.Qual
Kitchen.Qual
NeighborhoodNorthridge
BsmtFin.Type.1Unf
Forward stepwise Selection
x
(Intercept)
tot_rms_abv_grd
overall_qual
lot_area
Bsmt.Qual
Kitchen.Qual
NeighborhoodNorthridge
BsmtFin.Type.1Unf
Backward stepwise Selection
x
(Intercept)
tot_rms_abv_grd
overall_qual
lot_area
Bsmt.Qual
Kitchen.Qual
NeighborhoodNorthridge
BsmtFin.Type.1Unf

We followed our variable selection analysis with running cross validation that allowed us to produce a 10-fold CV error estimates for polynomial regression, cubic splines, and smoothing splines.

A degree 2 smoothing spline appears to be the best model choice for lot area. It has the lowest CV error and the lowest has the most stable curve.

A degree 6 smoothing spline appears to be the best fit for the total rooms above grade variable. While a lower degree cubic spine is comparable, the cubic spline becomes more unstable at higher degrees.

A degree 6 smoothing spline appears to be a good fit here, however other models appear to do comparably as well.

A quadratic polynomial appear to be the best fit for this model as it has the lowest error.

A cubic spline with 8 degrees of freedom appears to be the best model in this case. Other models are close in CV error and are fairly stable, but the cubic spline model has the lowest error.

A degree three polynomial appears to be the best option as it has the lowest CV error rate. Cubic spline has only one point so it is unclear whether it has a stable trend.

## [1] 974957404
## 
## Call: gam(formula = saleprice ~ s(lot_area, 2) + s(tot_rms_abv_grd, 
##     6) + s(overall_qual, 6) + poly(Kitchen.Qual, 2) + bs(year_built, 
##     8) + poly(Bsmt.Qual, 3) + Neighborhood + full_bath_abv_grd + 
##     Roof.Style + BsmtFin.Type.1, data = training)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -307855.7  -15110.9    -602.1   13204.9  229251.2 
## 
## (Dispersion Parameter for gaussian family taken to be 1003401562)
## 
##     Null Deviance: 14893558370411 on 2327 degrees of freedom
## Residual Deviance: 2269694308185 on 2262 degrees of freedom
## AIC: 54925.29 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df        Sum Sq       Mean Sq  F value
## s(lot_area, 2)           1  867039361341  867039361341  864.100
## s(tot_rms_abv_grd, 6)    1 3083264528360 3083264528360 3072.812
## s(overall_qual, 6)       1 6019444199776 6019444199776 5999.038
## poly(Kitchen.Qual, 2)    2  356665791154  178332895577  177.728
## bs(year_built, 8)        8  240713961304   30089245163   29.987
## poly(Bsmt.Qual, 3)       3  185458199658   61819399886   61.610
## Neighborhood            27  394673946803   14617553585   14.568
## full_bath_abv_grd        1   45638898787   45638898787   45.484
## Roof.Style               5    8518758484    1703751697    1.698
## BsmtFin.Type.1           5  108739787663   21747957533   21.674
## Residuals             2262 2269694308185    1003401562         
##                                      Pr(>F)    
## s(lot_area, 2)        < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) < 0.00000000000000022 ***
## s(overall_qual, 6)    < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2) < 0.00000000000000022 ***
## bs(year_built, 8)     < 0.00000000000000022 ***
## poly(Bsmt.Qual, 3)    < 0.00000000000000022 ***
## Neighborhood          < 0.00000000000000022 ***
## full_bath_abv_grd          0.00000000001947 ***
## Roof.Style                           0.1317    
## BsmtFin.Type.1        < 0.00000000000000022 ***
## Residuals                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                       Npar Df Npar F                 Pr(F)    
## (Intercept)                                                   
## s(lot_area, 2)              1 80.448 < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6)       5 22.795 < 0.00000000000000022 ***
## s(overall_qual, 6)          5 37.108 < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2)                                         
## bs(year_built, 8)                                             
## poly(Bsmt.Qual, 3)                                            
## Neighborhood                                                  
## full_bath_abv_grd                                             
## Roof.Style                                                    
## BsmtFin.Type.1                                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 31338.57
## [1] 21132.8
## [1] 32828.06
## [1] 23224.5
##   tot_rms_abv_grd overall_qual lot_area year_built Bsmt.Qual Kitchen.Qual
## 1               2            6    10000       2005        90            3
##   Neighborhood full_bath_abv_grd Roof.Style BsmtFin.Type.1
## 1     Somerset                 2        Hip            ALQ
##     full_bath_abv_grd tot_rms_abv_grd            home_type overall_qual
## 254                 2               5 1-STORY 1945 & OLDER            5
##     lot_area year_built Garage.Qual Exterior.1st   Foundation Bsmt.Qual
## 254     4853       1924           3 Metal Siding Brick & Tile        80
##     Heating.QC Roof.Style Kitchen.Qual                       Neighborhood
## 254          2      Gable            2 South & West Iowa State University
##               Fence Street   Land.Slope Yr.Sold saleprice BsmtFin.Type.1
## 254 Minimum Privacy  Paved Gentle Slope    2010    104000            Rec
##      254 
## 109580.7
##     full_bath_abv_grd tot_rms_abv_grd            home_type overall_qual
## 254                 3               5 1-STORY 1945 & OLDER            5
##     lot_area year_built Garage.Qual Exterior.1st   Foundation Bsmt.Qual
## 254     4853       1924           3 Metal Siding Brick & Tile        80
##     Heating.QC Roof.Style Kitchen.Qual                       Neighborhood
## 254          2      Gable            2 South & West Iowa State University
##               Fence Street   Land.Slope Yr.Sold saleprice BsmtFin.Type.1
## 254 Minimum Privacy  Paved Gentle Slope    2010    104000            Rec

Analysis

Discussion

References